Ch5 Dimensionality reduction techniques

Describe the work you have done this week and summarize your learning.

date()
## [1] "Thu Dec  1 18:04:45 2022"

Here we go again…

Libraries used for Ch5

Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).

#install.packages(c("tidyverse","GGally","readr", "MASS", "corrplot", "psych"))

Load the needed R packages before getting to work:

#load required packages
library(tidyverse)
library(GGally)
#library(dplyr)
library(ggplot2)
library(corrplot)
#library(stringr)
library(psych) 
library(FactoMineR)

Data analysis (15p.)

1. Loading and describing the data ‘human’

# load the 'human' data
human <- read.csv("human_row_names.csv", row.names = 1)
# Alternative data url
# human <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", row.names = 1)

# explore the dataset
str(human);dim(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## [1] 155   8

Graphical overview of the data

# summary of the data
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# Summary plot matrix with correlation and density plots
p1 <- ggpairs(human, mapping = aes(alpha=0.3), 
              title="Descriptives and Pearson correlation",
              lower = list(combo = wrap("facethist", bins = 20)))
p1

# compute the correlation matrix and visualize it with corrplot

#correaltion matrix
cor_matrix_h <- cor(human, method='spearman')

#p-values that correspond our correlation coefficients
cor_pmatrix_h <- corr.test(human, method = "spearman")$p

# visualize
corrplot(cor_matrix_h, type='upper', tl.col='black', col=COL2('BrBG'), addCoef.col='black', number.cex = 0.6, p.mat = cor_pmatrix_h, sig.level = 0.05)

All our variables are skewed. The significance values (p-values) seem to differ a little between the two plots. In our first plot p-values are determined as

  • “***” p < 0.001
  • “**” p < 0.01
  • “*” p < 0.05
  • “.” p < 0.10
  • ” ” otherwise

Our second plot shows all Spearman’s correlation coefficients in numbers, the color and sizes of the circles, and correlations with p < 0.05 are seen stroke out.

This difference is due to ggpairs() (1st plot) using Pearson’s correlations and us assigning corrplot() (2nd plot) with Spearman’s correlation derived data. Since our variables seemed skewed, I prefer using Spearman’s correlation for bivariate correlation analysis.

There seems to be several strong and statistically significant correlations:

  • Higher maternal mortality (Mat.Mor) is correlated with:
    • lower ratio of females to males with at least secondary education (Edu2.FM)
    • lower expected years of schooling (Edu_Exp)
    • lower life expectancy (Life_Exp)
    • lower gross national income per capita (GNI)
    • higher adolescent birth rate (Ado.Birth)
  • Higher adolescent birth rate is correlated with:
    • lower ratio of females to males with at least secondary education
    • lower expected years of schooling
    • lower life expectancy
    • lower GNI
  • Higher GNI seems to associated with:
    • higher ratio of females to males with at least secondary education
    • higher expected years of schooling
    • higher life expectancy
  • It seems that ratio of females to males in the labour force (Labo.FM) nor percentage of female representatives in parliament (Parli.F) are not statistically significantly associated with any of the other variables.

2. Principal component analysis with raw data

I have had a hard time understanding PCA, but our course material had a very clear intro to this: (Quoted directly from our course material:) *“Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).

There are two functions in the default package distribution of R that can be used to perform PCA: princomp() and prcomp(). The prcomp() function uses the SVD and is the preferred, more numerically accurate method.

Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.”*

We’ll first run a PCA on raw, non-standardized data (part 2) and then on standardized data (part 3). Finally, we’ll interpret the results on part 4 of this assignment.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# create and print out a summary of pca_human (= variability captured by the principal components)
s <- summary(pca_human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 1)*100
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

All the data is clustered to the upper right corner of the plot. PC1 explains 100% of the total variance of the dataset. Because of a vector parallel to PC1, we know that GNI contributes mostly and only to PC1. This means that knowing the position of an observation in relation to PC1, it would be possible to have an accurate view of how the observation is related to other observations in our sample and basically we would only need GNI-information to determine that. Since the data is not scaled the high influence of GNI to PC1 is most probably due to GNI having the mos variance of the variables in our dataset!

3. Principal component analysis with standardized data

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
# create and print out a summary of pca_human (= variability captured by the principal components)
s_hs <- summary(pca_human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentanges of variance captured by each PC
pca_pr_hs <- round(1*s_hs$importance[2, ], digits = 1)*100
pca_pr_hs
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
##  50  20  10  10  10   0   0   0
# create object pc_lab to be used as axis labels
pc_lab_hs <- paste0(names(pca_pr_hs), " (", pca_pr_hs, "%)")

# draw a biplot
biplot(pca_human_std, choices=1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_hs[1], ylab = pc_lab_hs[2])

The plot with standardized data look very different from the earlier biplot made with unstandardized data. Also, the variability captured by the principal components seems to be distributed more realistically between PCs. The results between unstandardized and standardized data PCA are different since PCA requires data scaling for variable comparison.

For the PCA with standardized data:

  • PC1 explains 54 % of variance in the dataset.
  • PC2 explains 16 % of variance in the dataset.
  • PC3 explains 10 % of variance in the dataset.
  • PC4 explains 8 % of variance in the dataset.
  • PC5 explains 5 % of variance in the dataset.
  • PC6 explains 4 % of variance in the dataset.
  • PC7 explains 3 % of variance in the dataset.
  • PC8 explains 1 % of variance in the dataset.

In summary, the data should be scaled before PCA for the analysis to be trustworthy.

4. Interpretations

I’ll elaborate on the results of the PCA with standardized data:

PC1 explains the most (53,6%) of the variability in the dataset, and the variables that mostly affect it are:

  • Expected years of schooling (positive effect)
  • Life expectancy at birth (positive effect)
  • GNI (positive effect)
  • Ratio of females to males with at least secondary education (positive effect)
  • Maternal mortality (negative effect)
  • Adolescent birth rate (negative effect)

The four variables with a positive effect on PC1 are positively correlated with each other and the ones with a negative effect (two) are correlated positively with each other and negatively with the prior four.

PC2 explains the second most (16,2%) of the variability in the dataset, and the variables that mostly affect it are:

  • Ratio of females to males in the labour force (positive effect)
  • Percentage of female representatives in parliament (positive effect)

These two variables are positively correlated with each other and not with any of the variables affecting PC1 (since the arrows are perpendicular to the others).

5. Analysis of tea dataset

According to our assignment the tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

We’ll first load the data (wrangled by our course professor K.V.) and then start exploring and analyzing it!

# load the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Structure and dimensions of the data
str(tea);dim(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300  36
# View data
View(tea)

Let’s visualize the data:

# summary of the data
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# Summary plot matrix grouped by sex and age quartile

tea_long <- tea %>% 
  pivot_longer(cols=-c(sex,age_Q, age), names_to = "variable",values_to = 'value') %>%
  group_by(sex,age_Q,variable,value) %>%
  summarise(n=n())
tea_long
## # A tibble: 764 × 5
## # Groups:   sex, age_Q, variable [330]
##    sex   age_Q variable         value                   n
##    <fct> <fct> <chr>            <fct>               <int>
##  1 F     +60   always           always                  2
##  2 F     +60   always           Not.always             21
##  3 F     +60   breakfast        breakfast               8
##  4 F     +60   breakfast        Not.breakfast          15
##  5 F     +60   dinner           dinner                  1
##  6 F     +60   dinner           Not.dinner             22
##  7 F     +60   diuretic         diuretic               14
##  8 F     +60   diuretic         Not.diuretic            9
##  9 F     +60   effect.on.health effect on health        6
## 10 F     +60   effect.on.health No.effect on health    17
## # … with 754 more rows
# Barplots of all variables

# visualize the dataset
pivot_longer(tea, cols = -c(age)) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Unfortunately there is so much data we don’t get good looking plots. We’ll have to choose some of the most interesting variables to plot:

# column names to keep in the dataset
keep_columns <- c("age_Q", "sex", "Tea", "How", "price", "SPC", "Sport", "frequency", "relaxing", "healthy", "sophisticated")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, all_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##    age_Q    sex            Tea         How                  price    
##  +60  :38   F:178   black    : 74   alone:195   p_branded      : 95  
##  15-24:92   M:122   Earl Grey:193   lemon: 33   p_cheap        :  7  
##  25-34:69           green    : 33   milk : 63   p_private label: 21  
##  35-44:40                           other:  9   p_unknown      : 12  
##  45-59:61                                       p_upscale      : 53  
##                                                 p_variable     :112  
##                                                                      
##            SPC               Sport           frequency          relaxing  
##  employee    :59   Not.sportsman:121   +2/day     :127   No.relaxing:113  
##  middle      :40   sportsman    :179   1 to 2/week: 44   relaxing   :187  
##  non-worker  :64                       1/day      : 95                    
##  other worker:20                       3 to 6/week: 34                    
##  senior      :35                                                          
##  student     :70                                                          
##  workman     :12                                                          
##         healthy              sophisticated
##  healthy    :210   Not.sophisticated: 85  
##  Not.healthy: 90   sophisticated    :215  
##                                           
##                                           
##                                           
##                                           
## 
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free", ncol=6) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Again, quoted a very insightful part of our course material: Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.

Since there are multiple variables. We’ll choose and focus on only the ones we previously chose: “age_Q”, “sex”, “Tea”, “How”, “price”, “SPC”, “Sport”, “frequency”, “relaxing”, “healthy”, “sophisticated” and run a MCA.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = TRUE)
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# summary of the model
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 11 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.205   0.179   0.138   0.129   0.126   0.125   0.114
## % of var.              8.059   7.023   5.404   5.073   4.946   4.910   4.486
## Cumulative % of var.   8.059  15.082  20.485  25.558  30.504  35.414  39.899
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.108   0.104   0.102   0.099   0.096   0.090   0.089
## % of var.              4.251   4.068   3.994   3.889   3.790   3.521   3.508
## Cumulative % of var.  44.150  48.218  52.212  56.101  59.891  63.412  66.920
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.085   0.082   0.076   0.074   0.071   0.068   0.065
## % of var.              3.353   3.209   3.003   2.896   2.809   2.667   2.536
## Cumulative % of var.  70.273  73.482  76.485  79.382  82.190  84.857  87.393
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.061   0.056   0.055   0.053   0.044   0.028   0.024
## % of var.              2.406   2.208   2.160   2.071   1.720   1.091   0.952
## Cumulative % of var.  89.798  92.007  94.166  96.237  97.957  99.048 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         |  0.205  0.068  0.009 |  0.966  1.740  0.207 |  0.345  0.289  0.027
## 2         |  0.298  0.145  0.036 |  0.576  0.618  0.135 |  0.099  0.024  0.004
## 3         | -0.118  0.023  0.006 |  0.135  0.034  0.007 |  0.034  0.003  0.000
## 4         | -0.497  0.401  0.183 | -0.191  0.068  0.027 | -0.017  0.001  0.000
## 5         | -0.237  0.091  0.031 |  0.274  0.140  0.042 |  0.078  0.015  0.003
## 6         | -0.800  1.040  0.253 |  0.002  0.000  0.000 |  0.364  0.322  0.053
## 7         | -0.020  0.001  0.000 |  0.542  0.549  0.107 |  0.441  0.472  0.070
## 8         |  0.197  0.063  0.014 |  0.231  0.099  0.019 |  0.337  0.275  0.041
## 9         |  0.195  0.062  0.014 |  0.569  0.603  0.119 |  0.617  0.922  0.140
## 10        |  0.496  0.400  0.090 |  0.323  0.195  0.038 |  0.553  0.742  0.112
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## +60       |   1.564  13.735   0.355  10.301 |  -1.327  11.337   0.255  -8.737 |
## 15-24     |  -1.100  16.446   0.535 -12.650 |  -0.578   5.203   0.148  -6.643 |
## 25-34     |  -0.125   0.159   0.005  -1.182 |   0.668   5.214   0.133   6.310 |
## 35-44     |   0.543   1.741   0.045   3.682 |   0.521   1.842   0.042   3.535 |
## 45-59     |   0.470   1.992   0.056   4.107 |   0.601   3.730   0.092   5.247 |
## F         |  -0.088   0.203   0.011  -1.836 |  -0.402   4.872   0.236  -8.393 |
## M         |   0.128   0.297   0.011   1.836 |   0.586   7.108   0.236   8.393 |
## black     |   0.708   5.484   0.164   7.008 |  -0.092   0.107   0.003  -0.913 |
## Earl Grey |  -0.375   4.017   0.254  -8.716 |   0.000   0.000   0.000  -0.010 |
## green     |   0.607   1.795   0.046   3.689 |   0.209   0.245   0.005   1.272 |
##             Dim.3     ctr    cos2  v.test  
## +60         0.600   3.011   0.052   3.950 |
## 15-24       0.209   0.882   0.019   2.399 |
## 25-34      -0.358   1.945   0.038  -3.380 |
## 35-44      -0.020   0.003   0.000  -0.134 |
## 45-59      -0.271   0.984   0.019  -2.364 |
## F          -0.247   2.393   0.089  -5.159 |
## M           0.360   3.491   0.089   5.159 |
## black       0.673   7.394   0.149   6.664 |
## Earl Grey  -0.130   0.719   0.030  -3.020 |
## green      -0.750   4.086   0.069  -4.557 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## age_Q     | 0.769 0.537 0.103 |
## sex       | 0.011 0.236 0.089 |
## Tea       | 0.255 0.007 0.185 |
## How       | 0.137 0.083 0.061 |
## price     | 0.106 0.115 0.351 |
## SPC       | 0.697 0.618 0.328 |
## Sport     | 0.140 0.061 0.104 |
## frequency | 0.019 0.135 0.104 |
## relaxing  | 0.069 0.120 0.034 |
## healthy   | 0.033 0.030 0.081 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali") #individuals (ind) invisible = plots variables

plot(mca, invisible=c("var"), graph.type = "classic") #variables (var) invisible = plots individuals

Our first and fourth plots (MCA factor map) show variable categories with similar profiles grouped together;

  • ‘non-workers’, ‘+60’ year olds and drinking tea in an ‘other’ way (i.e. not alone, not with lemon nor milk) have similar profiles. Also ‘student’ and ‘15-24’ year old seem to have similar profiles.

  • ‘non-workers’, ‘+60’ year olds and ‘other’ profiles seem to be negatively correlated with ‘student’ and ‘15-24’ yo categories (opposite sides of plot origin), and especially with ‘25-34’ year old and ‘workman’.

  • ‘non-workers’, ‘+60’ year olds, ‘other’ kind of drinkers, ‘student’, ‘15-24’ year olds, ‘senior’, and ‘p_unknown’ (tea price unknown) seem to be the variable categories that are most represented in our data (tehy are the farthers of the origo).

From the third plot (correlation between variables and principal dimensions) we can see that

  • SPC (socio-professional category) and age are correlated with each other
  • of all the variables SPC and age are the least correlated with the first dimensions (dim1 and 2) which explain the most (15%) of the variance of the chosen data.

To assess whether the variable categories differ significantly from each other we’ll draw confidence ellipses, which is possible with ‘FactoMineR’:

plotellipses(mca)

The plots are hard to read, so we’ll focus on specific variables 4 at a time.

plotellipses(mca, keepvar = c("sex", "Tea", "age_Q", "How"))

plotellipses(mca, keepvar = c("Sport", "price", "frequency", "SPC"))

plotellipses(mca, keepvar = c("relaxing", "healthy", "sophisticated"))

This shows us:

  • categories ‘sex’ and ‘How’ seem to be different from each other.
  • For the ‘age_Q’ variable, ‘15-24’ and ‘+60’ categories seem different from the other categories.
  • ‘Earl Grey’ seems to differ from the other tea varieties
  • Drinking ‘Other’ way tea seems to differ from drinking alone, with lemon or with milk.
  • ‘students’ on ‘non-workers’ are different and different from the other SPC-categories.
  • Doing sports or not differs.
  • Frequency categories don’t seem to differ significantly from each other.
  • Tea with unknown price ‘p_unknown’ seems to differ from the other price categories.
  • All three product perception variables ‘healthy’, ‘relaxing’ and ‘sophisticated’ seem to have differing categories.

✓ ✓ ✓ ✓ ✓ Ch5 done!